Learning (k, l)-Contextual Tree Languages for Information Extraction
نویسندگان
چکیده
Learning regular languages from positive examples only is known to be infeasible. A common solution is to define a learnable subclass of the regular languages. In the past, this has been done for regular string languages. Using ideas from those techniques, we define a learnable subclass of regular unranked tree languages, called the (k,l)-contextual tree languages. We describe the use of this subclass to induce wrappers for Information Extraction from structured documents, such as web pages. Experiments show that our algorithm is able to learn from very few data, and compares favorably to similar state of the art approaches. Learning (k,l)-Contextual Tree Languages for Information Extraction Stefan Raeymaekers, Maurice Bruynooghe, and Jan Van den Bussche 1 K.U.Leuven, Dept. of Computer Science, Celestijnenlaan 200A, B-3001 Leuven, {stefanr,maurice}@cs.kuleuven.ac.be 2 University of Limburg, Dept. WNI, Universitaire Campus, B-3590 Diepenbeek, [email protected] Abstract. Learning regular languages from positive examples only is known to be infeasible. A common solution is to define a learnable subclass of the regular languages. In the past, this has been done for regular string languages. Using ideas from those techniques, we define a learnable subclass of regular unranked tree languages, called the (k,l)-contextual tree languages. We describe the use of this subclass to induce wrappers for Information Extraction from structured documents, such as web pages. Experiments show that our algorithm is able to learn from very few data, and compares favorably to similar state of the art approaches. Learning regular languages from positive examples only is known to be infeasible. A common solution is to define a learnable subclass of the regular languages. In the past, this has been done for regular string languages. Using ideas from those techniques, we define a learnable subclass of regular unranked tree languages, called the (k,l)-contextual tree languages. We describe the use of this subclass to induce wrappers for Information Extraction from structured documents, such as web pages. Experiments show that our algorithm is able to learn from very few data, and compares favorably to similar state of the art approaches.
منابع مشابه
Wrapper Induction: Learning (k,l)-Contextual Tree Languages Directly as Unranked Tree Automata
A (k, l)-contextual tree language can be learned from positive examples only; such languages have been successfully used as wrappers for information extraction from web pages. This paper shows how to represent the wrapper as an unranked tree automaton and how to construct it directly from the examples instead of using the (k, l)-forks of the examples. The former speeds up the extraction, the la...
متن کاملParameterless Information Extraction Using (k,l)-Contextual Tree Languages
Recently, several wrapper induction algorithms for structured documents have been introduced. They are based on contextual tree languages and learn from positive examples only but have the disadvantage that they need parameters. To obtain the optimal parameter setting, they use precision and recall. This goes in fact beyond learning from positive examples only. In this paper, a parameter estima...
متن کاملInformation extraction from structured documents using k-testable tree automaton inference
Information extraction (IE) addresses the problem of extracting specific information from a collection of documents. Much of the previous work on IE from structured documents, such as HTML or XML, uses learning techniques that are based on strings, such as finite automata induction. These methods do not exploit the tree structure of the documents. A natural way to do this is to induce tree auto...
متن کاملUnsupervised Learning of Contextual Role Knowledge for Coreference Resolution
We present a coreference resolver called BABAR that uses contextual role knowledge to evaluate possible antecedents for an anaphor. BABAR uses information extraction patterns to identify contextual roles and creates four contextual role knowledge sources using unsupervised learning. These knowledge sources determine whether the contexts surrounding an anaphor and antecedent are compatible. BABA...
متن کاملNLP Techniques for Term Extraction and Ontology Population
This chapter investigates NLP techniques for ontology population, using a combination of rule-based approaches and machine learning. We describe a method for term recognition using linguistic and statistical techniques, making use of contextual information to bootstrap learning. We then investigate how term recognition techniques can be useful for the wider task of information extraction, makin...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2005